Skip to content

Conversation

@ChuxiJ
Copy link
Contributor

@ChuxiJ ChuxiJ commented Feb 10, 2026

Summary

Add a comprehensive GPU tier configuration system and boundary testing framework to determine minimum VRAM requirements for different optimization levels.

GPU Tier System (acestep/gpu_config.py)

  • Add GPUConfig dataclass with per-tier settings for quantization, offload, LM models, batch size, and duration limits
  • Implement automatic GPU tier detection based on available VRAM
  • Support VRAM simulation via MAX_CUDA_VRAM environment variable with hard VRAM cap enforcement using torch memory fraction
  • Define tiers: tier1(4GB), tier2(6GB), tier3(8GB), tier4(12GB), tier5(14GB), tier6a(16GB), tier6b(24GB), unlimited(48GB+)

Boundary Testing (profile_inference.py)

  • Add --tier-boundary flag to tier-test mode for automated boundary analysis across all VRAM tiers
  • Refactor tier test logic into reusable _run_single_tier_test()
  • Test three variants per tier: default, no-quant, no-offload
  • Smart skipping when tier already disables the tested optimization
  • Add _print_boundary_summary() with clear results table

Boundary Test Results

  • No INT8 Quantization: minimum tier2 (6GB), peak 4.91GB
  • No CPU Offload: minimum tier3 (8GB), peak 7.30GB

Handler & UI Updates

  • Enhanced model offload/load context management in handler.py
  • Updated Gradio UI to expose GPU tier settings
  • Updated API server for tier-aware configuration
  • Improved nano-vllm model runner compatibility

Documentation

  • Updated GPU_COMPATIBILITY docs (en/zh/ja/ko)
  • Updated BENCHMARK docs (en/zh) with tier-boundary CLI reference
  • Updated INFERENCE, INSTALL, GRADIO_GUIDE docs across all languages
  • Updated README with GPU tier information

Summary by CodeRabbit

  • New Features

    • UI now auto-selects recommended LM, backend, offload, quantization and updates audio duration/batch sliders by detected GPU tier (including tier-aware recommendations).
  • Improvements

    • New multi-column GPU compatibility guidance (more granular VRAM bands, Backend/Notes, Tier 6a/6b).
    • Tier-test profiling mode and VRAM profiling utility for automated tier/boundary validation.
    • Raised auto-offload threshold (~20GB) and safer LM downgrade for very-low-VRAM GPUs.
    • Enhanced VRAM guard: adaptive chunk sizing, batch clamping, and CPU fallback for VAE decode.
  • Documentation

    • Expanded localized docs (EN/JA/KO/ZH) with tier-aware defaults and testing guides.

## Summary
Add a comprehensive GPU tier configuration system and boundary testing
framework to determine minimum VRAM requirements for different
optimization levels.

## GPU Tier System (acestep/gpu_config.py)
- Add GPUConfig dataclass with per-tier settings for quantization,
  offload, LM models, batch size, and duration limits
- Implement automatic GPU tier detection based on available VRAM
- Support VRAM simulation via MAX_CUDA_VRAM environment variable
  with hard VRAM cap enforcement using torch memory fraction
- Define tiers: tier1(4GB), tier2(6GB), tier3(8GB), tier4(12GB),
  tier5(14GB), tier6a(16GB), tier6b(24GB), unlimited(48GB+)

## Boundary Testing (profile_inference.py)
- Add --tier-boundary flag to tier-test mode for automated boundary
  analysis across all VRAM tiers
- Refactor tier test logic into reusable _run_single_tier_test()
- Test three variants per tier: default, no-quant, no-offload
- Smart skipping when tier already disables the tested optimization
- Add _print_boundary_summary() with clear results table

## Boundary Test Results
- No INT8 Quantization: minimum tier2 (6GB), peak 4.91GB
- No CPU Offload: minimum tier3 (8GB), peak 7.30GB

## Handler & UI Updates
- Enhanced model offload/load context management in handler.py
- Updated Gradio UI to expose GPU tier settings
- Updated API server for tier-aware configuration
- Improved nano-vllm model runner compatibility

## Documentation
- Updated GPU_COMPATIBILITY docs (en/zh/ja/ko)
- Updated BENCHMARK docs (en/zh) with tier-boundary CLI reference
- Updated INFERENCE, INSTALL, GRADIO_GUIDE docs across all languages
- Updated README with GPU tier information
@coderabbitai
Copy link

coderabbitai bot commented Feb 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds extensive VRAM-aware GPU tiering, adaptive GPUConfig defaults (LM, backend, offload, quantization), LM selection/downgrade rules, VAE/decoding VRAM fallbacks, VRAM simulation and profiling tools (tier-test, profile_vram), Gradio UI tier-aware defaults/limits, and broad docs updates. No model algorithm changes.

Changes

Cohort / File(s) Summary
GPU Configuration Core
acestep/gpu_config.py
Major expansion: new VRAM constants (VRAM_AUTO_OFFLOAD_THRESHOLD_GB), tier split (tier6a/tier6b) and alias, extended GPUConfig fields (recommended_lm_model, lm_backend_restriction, recommended_backend, offload_/quantization_/compile_ defaults), VRAM accounting helpers, adaptive-config calc, LM-size selection helpers, debug/MAX_CUDA_VRAM support.
Pipeline & API Integration
acestep/acestep_v15_pipeline.py, acestep/api_server.py
Replace hard-coded 16GB checks with VRAM_AUTO_OFFLOAD_THRESHOLD_GB; add 4B→1.7B downgrade when GPU memory below threshold; surface new constant via imports and update messages.
UI Integration
acestep/gradio_ui/interfaces/generation.py, acestep/gradio_ui/events/generation_handlers.py, acestep/gradio_ui/events/__init__.py
Tier-aware UI defaults and constraints: available LM models/backends filtered by tier, recommended model resolution from disk, backend restrictions, dynamic duration/batch updates after init, and enriched user info/status messages.
Memory Management & Safety
acestep/handler.py
VAE_DECODE_MAX_CHUNK_SIZE, VRAM-aware auto chunk sizing, _vram_guard_reduce_batch to auto-reduce batch, _decode_on_cpu CPU fallback, tiled-decode OOM cascades, and expanded VRAM logging/warnings.
vLLM / KV-budget safety
acestep/third_parts/nano-vllm/.../model_runner.py
Add MAX_CUDA_VRAM simulation hook, enforce minimum post-KV free VRAM, and emit diagnostics/warnings when post-allocation free VRAM < 1GB.
Profiling & Tier Testing
profile_inference.py
New tier-test mode and CLI flags (--tiers, --tier-with-lm, --tier-skip-compile, --tier-boundary, --tier-batch-boundary): automates per-tier simulation, model selection, short infer tests, boundary/batch analysis, and summaries.
VRAM Profiling Utility
scripts/profile_vram.py
New script for component-level VRAM profiling (DiT, VAE, text encoder, LM), memory stats, OOM-safe measurement loops, and optional JSON output to aid VRAM calibration.
Docs & Guides (EN/JA/KO/ZH) + README
README.md, docs/**
Replace simple GPU table with richer backend-aware tier tables and Adaptive UI Defaults; document VRAM guard, adaptive VAE decode, auto chunk sizing, MAX_CUDA_VRAM debug usage, tier-test and boundary testing; update install/guide text and examples.
UI wiring minor
acestep/gradio_ui/events/__init__.py
Extend init_btn outputs to include audio_duration and batch_size_input updates after initialization.

Sequence Diagram(s)

sequenceDiagram
    participant UI as Gradio UI
    participant API as API Server / Pipeline
    participant GPU as gpu_config
    participant Disk as Disk models
    participant LM as LM Loader
    participant VAE as VAE/DiT runtime

    UI->>API: init request (init_params)
    API->>GPU: probe get_gpu_memory_gb(), get_gpu_tier()
    GPU-->>API: GPUConfig (recommended_lm, backend, offload, quantization, limits)
    API->>Disk: find_best_lm_model_on_disk(recommended_lm)
    Disk-->>API: chosen_model or none
    API->>LM: attempt LM init (selected model, backend)
    alt LM too large or backend restricted
        LM-->>API: fail / downgrade -> API disables LM or selects smaller model
    end
    API->>UI: return tier-derived UI updates (duration, batch, warnings)
    UI->>API: start generation
    API->>GPU: estimate_inference_vram()
    alt estimated > available
        API->>API: _vram_guard_reduce_batch -> adjust batch/duration
        API->>VAE: use adaptive chunk size or perform _decode_on_cpu on OOM
    end
    API-->>UI: generation results or OOM diagnostics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • ChuxiJ

Poem

🐰
I hopped through tiers of VRAM bright,
I nudged the UI, tuned models right.
When memory's thin I gently shrink,
I stitch fallbacks so things won't sink.
Hooray — safe runs, with carrots in sight. 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: introducing a GPU compatibility tier system with boundary testing capabilities across the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 80.88% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/gpu-compatibility-tier-boundary-testing

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
acestep/third_parts/nano-vllm/nanovllm/engine/model_runner.py (1)

247-260: ⚠️ Potential issue | 🟠 Major

Reserve can be bypassed when available_for_kv_cache <= 0.
With MAX_CUDA_VRAM simulation or high current usage, the fallback (free * 0.5) plus max(1, …) still allocates KV cache, defeating the 1 GB reserve and risking immediate OOM. Consider clamping to zero and letting the existing guard raise when there’s no headroom.

🛠️ Proposed fix to respect the reserve and fail fast
-        if available_for_kv_cache <= 0:
-            available_for_kv_cache = free * 0.5  # Fallback to 50% of free memory
-
-        config.num_kvcache_blocks = max(1, int(available_for_kv_cache) // block_bytes)
-        if config.num_kvcache_blocks <= 0:
+        if available_for_kv_cache <= 0:
+            available_for_kv_cache = 0
+
+        config.num_kvcache_blocks = int(available_for_kv_cache) // block_bytes
+        if config.num_kvcache_blocks <= 0:
             raise RuntimeError(
                 f"Insufficient GPU memory for KV cache. "
                 f"Free: {free / 1024**3:.2f} GB, Current: {current / 1024**3:.2f} GB, "
                 f"Available for KV: {available_for_kv_cache / 1024**3:.2f} GB, "
                 f"Block size: {block_bytes / 1024**2:.2f} MB"
             )
🤖 Fix all issues with AI agents
In `@acestep/handler.py`:
- Around line 1608-1617: The VRAM guard in _vram_guard_reduce_batch is checking
self.config_path which initialize_service never sets, so base-model detection
never triggers; update the check to use the existing config object (e.g.,
self.config) instead—inspect self.config.is_turbo or other fields on self.config
to determine base vs turbo and multiply per_sample_gb by 2.0 when appropriate;
ensure this logic is applied where per_sample_gb is computed in
_vram_guard_reduce_batch and remove or stop relying on self.config_path, or set
self.config_path during initialize_service if you prefer that pattern.
- Around line 3709-3718: The VRAM auto-check erroneously runs on non-CUDA
backends (MPS/XPU) because get_effective_free_vram_gb() returns 0 when
torch.cuda.is_available() is false, forcing VAE decode to CPU; change the logic
in the generate_music VAE decision block to only call
get_effective_free_vram_gb() and apply the _effective_free < 0.5 gate when
torch.cuda.is_available() is true (i.e., wrap the effective-free-VRAM check in a
cuda-available conditional), while preserving the ACESTEP_VAE_ON_CPU env
override and the _vae_cpu variable behavior so only CUDA devices can auto-enable
CPU VAE decode.

In `@acestep/third_parts/nano-vllm/nanovllm/engine/model_runner.py`:
- Around line 269-282: The f-string log in model_runner.py uses a Unicode
multiplication sign (×) which triggers RUF001 and can cause copy/paste/terminal
issues; update the print statement that formats KV cache info (the one
referencing config.num_kvcache_blocks, self.block_size, max_tokens_capacity,
kv_cache_size_gb, free, current, target_total_usage, block_bytes, post_kv_free)
to replace the Unicode "×" with a plain ASCII "x" character so the message
becomes e.g. "{config.num_kvcache_blocks} blocks x {self.block_size} tokens =
..." while keeping the rest of the formatting unchanged.

In `@docs/en/ace_step_musicians_guide.md`:
- Around line 157-160: Update the enthusiast tier entry so its batch-size range
follows the tier progression: locate the line containing "16-20 GB (enthusiast)"
and the phrase "1-4 songs at a time" and change it to "2-4 songs at a time"
(keeping the rest of the text, e.g., "Songs up to 10 minutes" and "Larger
Songwriter brain (1.7B)" unchanged) so the lower bound is consistent with the
8-12GB and 12-16GB tiers.

In `@docs/en/BENCHMARK.md`:
- Around line 160-223: The sample output code fence under the "tier-test"
section (the "TIER TEST RESULTS" block) lacks a language tag; update the opening
fence from ``` to ```text to satisfy MD040 and Markdown linting, leaving the
fence contents and closing ``` unchanged so the block is explicitly marked as
plain text.

In `@docs/zh/BENCHMARK.md`:
- Around line 187-199: The fenced code block that begins with "TIER TEST
RESULTS" is missing a language specifier; update the opening fence from ``` to
```text (or another appropriate spec like ```console) so syntax highlighters and
accessibility tools recognize it—modify the code block in the
docs/zh/BENCHMARK.md content around the "TIER TEST RESULTS" section to include
the language tag on the opening backticks.

In `@profile_inference.py`:
- Line 1072: The print call using an unnecessary f-string should be changed to a
regular string: locate the statement print(f"\n  --- Variant: default ---") in
profile_inference.py and remove the leading f so it becomes print("\n  ---
Variant: default ---"); no other behavior changes are needed.
- Around line 1092-1109: Fix the unnecessary f-string prefixes on print
statements that have no interpolations: replace print(f"...") with print("...")
for the messages around the "no-quant" and "no-offload" variants in the block
that calls _run_single_tier_test; specifically update the print calls that
reference the no-quant and no-offload messages which use
gpu_config.quantization_default and gpu_config.offload_to_cpu_default to
determine skips so they are ordinary string literals instead of f-strings.
- Around line 850-853: Remove the now-unnecessary noqa by deleting the "# noqa:
F401" comment on the import flash_attn line in the try block (the import and the
subsequent use of use_flash_attention already satisfy linter rules), i.e.,
update the import flash_attn statement so it no longer includes the noqa
directive.
- Around line 1276-1286: The two print calls that use f-strings without
placeholders should be regular strings: in the block that references
capability_name, failing, and passing (the one that prints the boundary summary
and returns passing[0] if any), replace print(f"    ❌ No tier passed this test.
All tested tiers failed.") and print(f"    ⚠️ No test results available for this
capability.") with print("    ❌ No tier passed this test. All tested tiers
failed.") and print("    ⚠️ No test results available for this capability.")
respectively so the unnecessary f-string prefixes are removed.

In `@scripts/profile_vram.py`:
- Around line 318-321: The code builds encoder_path =
os.path.join(checkpoint_dir, "text_encoder") which doesn't match the runtime
checkpoint name (e.g., "Qwen3-Embedding-0.6B"), so profiling can skip the
encoder; update the logic in scripts/profile_vram.py around encoder_path to try
the runtime checkpoint name as a fallback (check for
os.path.exists(os.path.join(checkpoint_dir, "Qwen3-Embedding-0.6B")) if the
"text_encoder" path is missing) and only return {} after both attempts fail, or
prefer the runtime-named directory when present; ensure references to
encoder_path, checkpoint_dir and the literal names ("text_encoder",
"Qwen3-Embedding-0.6B") are used so the handler and this script align.
- Around line 165-183: The current DiT profiling only allocates and deletes
dummy tensors (noise, text_hidden, text_mask) and never executes the model, so
peak memory misses activation usage; replace the no-op block with a minimal
forward pass by calling the DiT model (e.g., model(noise, text_hidden,
text_mask) or model.forward(...)) inside the torch.inference_mode() context so
activations are allocated and measured, and when has_cfg is true duplicate the
inputs (noise_cfg, text_hidden_cfg, text_mask_cfg) and pass the doubled batch to
the model to simulate classifier-free guidance; alternatively, if you
intentionally only want to measure input allocation, rename peak_inference_gb to
peak_input_allocation_gb to reflect the narrower measurement.
🧹 Nitpick comments (7)
acestep/acestep_v15_pipeline.py (1)

212-223: String replacement for model downgrade is brittle.

The model path downgrade using replace("4B", "1.7B") assumes a specific naming pattern. If a model path contains "4B" elsewhere (e.g., in a directory name or version suffix), this could produce unexpected results.

Consider using a more robust approach that validates the replacement actually targets the model size portion of the path:

🛡️ Suggested improvement
     if args.lm_model_path and 0 < gpu_memory_gb < VRAM_AUTO_OFFLOAD_THRESHOLD_GB:
         if "4B" in args.lm_model_path:
-            # Downgrade to 1.7B if available
-            fallback = args.lm_model_path.replace("4B", "1.7B")
+            # Downgrade to 1.7B if available - only replace in model name portion
+            import re
+            # Match "4B" that appears to be a model size (preceded by - or lm-)
+            fallback = re.sub(r'(lm-|-)4B\b', r'\g<1>1.7B', args.lm_model_path)
+            if fallback == args.lm_model_path:
+                # Fallback didn't change anything meaningful, skip downgrade warning
+                fallback = None
acestep/gradio_ui/events/generation_handlers.py (1)

517-548: Minor: Duplicate get_global_gpu_config() call.

get_global_gpu_config() is called at line 450 and again at line 518. Since the GPU config is a singleton that doesn't change during initialization, you could reuse the earlier reference.

This is a very minor optimization and doesn't affect correctness.

docs/zh/GPU_COMPATIBILITY.md (1)

141-149: Add language specifier to fenced code block.

The code block showing the boundary analysis output is missing a language specifier. Since this is plain text output, use text or plaintext to satisfy the markdown linter.

📝 Proposed fix
-```
+```text
 BOUNDARY ANALYSIS
 =================
acestep/gradio_ui/interfaces/generation.py (1)

174-181: Consider disabling the LM checkbox for unsupported tiers.

The info text warns that LM is unavailable for low-VRAM tiers, but the checkbox remains interactive. Consider setting interactive=False when gpu_config.available_lm_models is empty to prevent users from enabling a non-functional feature.

♻️ Proposed enhancement
+                lm_interactive = bool(gpu_config.available_lm_models)
                 init_llm_checkbox = gr.Checkbox(
                     label=t("service.init_llm_label"),
                     value=init_llm_value,
                     info=lm_info_text,
+                    interactive=lm_interactive,
                 )
profile_inference.py (1)

1126-1152: Consider logging exceptions during handler cleanup instead of silent pass.

The try-except-pass pattern silently swallows all exceptions during cleanup. While cleanup should be resilient, logging at DEBUG level helps diagnose issues during development without cluttering normal output.

♻️ Proposed enhancement
+import logging
+
+logger = logging.getLogger(__name__)
+
 def _cleanup_handlers(dit_handler, llm_handler):
     """Clean up handlers and free GPU memory."""
     try:
         if dit_handler is not None:
             if hasattr(dit_handler, 'model') and dit_handler.model is not None:
                 dit_handler.model = None
             if hasattr(dit_handler, 'vae') and dit_handler.vae is not None:
                 dit_handler.vae = None
             if hasattr(dit_handler, 'text_encoder') and dit_handler.text_encoder is not None:
                 dit_handler.text_encoder = None
             del dit_handler
-    except Exception:
-        pass
+    except Exception as e:
+        logger.debug("DiT handler cleanup error (non-fatal): %s", e)

     try:
         if llm_handler is not None:
             if hasattr(llm_handler, 'llm') and llm_handler.llm is not None:
                 llm_handler.llm = None
             del llm_handler
-    except Exception:
-        pass
+    except Exception as e:
+        logger.debug("LLM handler cleanup error (non-fatal): %s", e)
acestep/handler.py (1)

1581-1586: Remove or use unused use_lm parameter.

use_lm is unused and triggers lint warnings. Either wire it into the estimate (LM overhead) or drop it from the signature.

🛠️ Proposed fix (remove if unused)
-        audio_duration: Optional[float] = None,
-        use_lm: bool = False,
+        audio_duration: Optional[float] = None,
acestep/gpu_config.py (1)

792-805: Ensure adaptive recommended LM is actually available.

compute_adaptive_config picks recommended_lm_model from tier defaults even when the VRAM-budgeted available_lm_models list is smaller. That can recommend a model that doesn’t fit the computed budget. Consider clamping to the largest available model when the tier default isn’t in available_lm_models.

🛠️ Proposed fix
-    return GPUConfig(
+    recommended_model = tier_config.get("recommended_lm_model", "")
+    if recommended_model not in available_lm_models:
+        recommended_model = available_lm_models[-1] if available_lm_models else ""
+    return GPUConfig(
         tier=tier,
         gpu_memory_gb=total_vram_gb,
         max_duration_with_lm=max_dur_lm,
         max_duration_without_lm=max_dur_no_lm,
         max_batch_size_with_lm=max_batch_with_lm,
         max_batch_size_without_lm=max_batch_no_lm,
         init_lm_default=bool(available_lm_models),
         available_lm_models=available_lm_models,
-        recommended_lm_model=tier_config.get("recommended_lm_model", available_lm_models[0] if available_lm_models else ""),
+        recommended_lm_model=recommended_model,
         lm_backend_restriction=tier_config.get("lm_backend_restriction", "all"),
         recommended_backend=tier_config.get("recommended_backend", "vllm"),
         offload_to_cpu_default=tier_config.get("offload_to_cpu_default", True),
         offload_dit_to_cpu_default=tier_config.get("offload_dit_to_cpu_default", True),
         quantization_default=tier_config.get("quantization_default", True),
         compile_model_default=tier_config.get("compile_model_default", True),
         lm_memory_gb=lm_memory_gb,
     )

Comment on lines +1608 to +1617
# Estimate per-sample activation cost for DiT
duration_sec = float(audio_duration) if audio_duration and float(audio_duration) > 0 else 60.0
# Empirical: ~0.8 GB per sample at 60s, linear scaling
per_sample_gb = 0.8 * (duration_sec / 60.0)
# If using cfg (base model), double the per-sample cost
if hasattr(self, 'model') and self.model is not None:
model_name = getattr(self, 'config_path', '') or ''
if 'base' in model_name.lower():
per_sample_gb *= 2.0

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Base-model detection in VRAM guard never triggers.

_vram_guard_reduce_batch checks self.config_path, but initialize_service never sets it. That means base models won’t double the per-sample estimate, so the guard can allow oversized batches and still OOM. Consider using self.config.is_turbo (or storing config_path during init) instead.

🛠️ Proposed fix (use config instead of config_path)
-        if hasattr(self, 'model') and self.model is not None:
-            model_name = getattr(self, 'config_path', '') or ''
-            if 'base' in model_name.lower():
-                per_sample_gb *= 2.0
+        if self.model is not None and self.config is not None:
+            if not getattr(self.config, "is_turbo", False):
+                per_sample_gb *= 2.0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Estimate per-sample activation cost for DiT
duration_sec = float(audio_duration) if audio_duration and float(audio_duration) > 0 else 60.0
# Empirical: ~0.8 GB per sample at 60s, linear scaling
per_sample_gb = 0.8 * (duration_sec / 60.0)
# If using cfg (base model), double the per-sample cost
if hasattr(self, 'model') and self.model is not None:
model_name = getattr(self, 'config_path', '') or ''
if 'base' in model_name.lower():
per_sample_gb *= 2.0
# Estimate per-sample activation cost for DiT
duration_sec = float(audio_duration) if audio_duration and float(audio_duration) > 0 else 60.0
# Empirical: ~0.8 GB per sample at 60s, linear scaling
per_sample_gb = 0.8 * (duration_sec / 60.0)
# If using cfg (base model), double the per-sample cost
if self.model is not None and self.config is not None:
if not getattr(self.config, "is_turbo", False):
per_sample_gb *= 2.0
🤖 Prompt for AI Agents
In `@acestep/handler.py` around lines 1608 - 1617, The VRAM guard in
_vram_guard_reduce_batch is checking self.config_path which initialize_service
never sets, so base-model detection never triggers; update the check to use the
existing config object (e.g., self.config) instead—inspect self.config.is_turbo
or other fields on self.config to determine base vs turbo and multiply
per_sample_gb by 2.0 when appropriate; ensure this logic is applied where
per_sample_gb is computed in _vram_guard_reduce_batch and remove or stop relying
on self.config_path, or set self.config_path during initialize_service if you
prefer that pattern.

Comment on lines +3709 to +3718
# Check effective free VRAM and auto-enable CPU decode if extremely tight
import os as _os
_vae_cpu = _os.environ.get("ACESTEP_VAE_ON_CPU", "0").lower() in ("1", "true", "yes")
if not _vae_cpu:
_effective_free = get_effective_free_vram_gb()
logger.info(f"[generate_music] Effective free VRAM before VAE decode: {_effective_free:.2f} GB")
# If less than 0.5 GB free, VAE decode on GPU will almost certainly OOM
if _effective_free < 0.5:
logger.warning(f"[generate_music] Only {_effective_free:.2f} GB free VRAM — auto-enabling CPU VAE decode")
_vae_cpu = True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find the get_effective_free_vram_gb function definition
rg "def get_effective_free_vram_gb" --type py -A 15

Repository: ace-step/ACE-Step-1.5

Length of output: 1003


🏁 Script executed:

# Also search for where this function is imported or used
rg "get_effective_free_vram_gb" --type py | head -20

Repository: ace-step/ACE-Step-1.5

Length of output: 733


🏁 Script executed:

# Check if there are device-specific memory utilities (e.g., for MPS)
rg "mps|MPS" acestep/handler.py -B 2 -A 2 | head -50

Repository: ace-step/ACE-Step-1.5

Length of output: 3029


🏁 Script executed:

# Inspect the context around lines 3709-3718 to understand self.device
sed -n '3680,3730p' acestep/handler.py

Repository: ace-step/ACE-Step-1.5

Length of output: 3186


🏁 Script executed:

# Get the full implementation of get_effective_free_vram_gb
rg "def get_effective_free_vram_gb" --type py -A 50

Repository: ace-step/ACE-Step-1.5

Length of output: 3404


🏁 Script executed:

# Also check if there are any MPS memory utilities in the codebase
rg "torch.backends.mps|torch.mps" --type py | grep -i memory

Repository: ace-step/ACE-Step-1.5

Length of output: 257


Gate VRAM check to CUDA devices only—otherwise MPS/XPU are forced to CPU decode.

get_effective_free_vram_gb() immediately returns 0 when torch.cuda.is_available() is False, so on MPS and XPU devices, _effective_free < 0.5 always evaluates true and unconditionally forces VAE decode to CPU. This degrades performance on systems with only MPS or XPU acceleration.

Gate the memory check to CUDA devices since the function is CUDA-specific:

🛠️ Proposed fix (gate by CUDA)
-                    if not _vae_cpu:
-                        _effective_free = get_effective_free_vram_gb()
-                        logger.info(f"[generate_music] Effective free VRAM before VAE decode: {_effective_free:.2f} GB")
-                        # If less than 0.5 GB free, VAE decode on GPU will almost certainly OOM
-                        if _effective_free < 0.5:
-                            logger.warning(f"[generate_music] Only {_effective_free:.2f} GB free VRAM — auto-enabling CPU VAE decode")
-                            _vae_cpu = True
+                    is_cuda = self.device == "cuda" or (isinstance(self.device, str) and self.device.startswith("cuda"))
+                    if not _vae_cpu and is_cuda:
+                        _effective_free = get_effective_free_vram_gb()
+                        logger.info(f"[generate_music] Effective free VRAM before VAE decode: {_effective_free:.2f} GB")
+                        if _effective_free < 0.5:
+                            logger.warning(f"[generate_music] Only {_effective_free:.2f} GB free VRAM — auto-enabling CPU VAE decode")
+                            _vae_cpu = True
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Check effective free VRAM and auto-enable CPU decode if extremely tight
import os as _os
_vae_cpu = _os.environ.get("ACESTEP_VAE_ON_CPU", "0").lower() in ("1", "true", "yes")
if not _vae_cpu:
_effective_free = get_effective_free_vram_gb()
logger.info(f"[generate_music] Effective free VRAM before VAE decode: {_effective_free:.2f} GB")
# If less than 0.5 GB free, VAE decode on GPU will almost certainly OOM
if _effective_free < 0.5:
logger.warning(f"[generate_music] Only {_effective_free:.2f} GB free VRAM — auto-enabling CPU VAE decode")
_vae_cpu = True
# Check effective free VRAM and auto-enable CPU decode if extremely tight
import os as _os
_vae_cpu = _os.environ.get("ACESTEP_VAE_ON_CPU", "0").lower() in ("1", "true", "yes")
is_cuda = self.device == "cuda" or (isinstance(self.device, str) and self.device.startswith("cuda"))
if not _vae_cpu and is_cuda:
_effective_free = get_effective_free_vram_gb()
logger.info(f"[generate_music] Effective free VRAM before VAE decode: {_effective_free:.2f} GB")
if _effective_free < 0.5:
logger.warning(f"[generate_music] Only {_effective_free:.2f} GB free VRAM — auto-enabling CPU VAE decode")
_vae_cpu = True
🤖 Prompt for AI Agents
In `@acestep/handler.py` around lines 3709 - 3718, The VRAM auto-check erroneously
runs on non-CUDA backends (MPS/XPU) because get_effective_free_vram_gb() returns
0 when torch.cuda.is_available() is false, forcing VAE decode to CPU; change the
logic in the generate_music VAE decision block to only call
get_effective_free_vram_gb() and apply the _effective_free < 0.5 gate when
torch.cuda.is_available() is true (i.e., wrap the effective-free-VRAM check in a
cuda-available conditional), while preserving the ACESTEP_VAE_ON_CPU env
override and the _vae_cpu variable behavior so only CUDA devices can auto-enable
CPU VAE decode.

Comment on lines +269 to +282
# If KV cache would leave less than 1 GB free, warn and suggest reducing max_model_len
post_kv_free = (free - config.num_kvcache_blocks * block_bytes) / 1024**3
if post_kv_free < 1.0:
print(
f"[nanovllm] WARNING: After KV cache allocation, only {post_kv_free:.2f} GB free. "
f"DiT inference may OOM. Consider reducing max_model_len or using CPU offload."
)

print(
f"[nanovllm] KV cache allocated: {config.num_kvcache_blocks} blocks × {self.block_size} tokens = "
f"{max_tokens_capacity} tokens capacity, {kv_cache_size_gb:.2f} GB "
f"(free: {free / 1024**3:.2f} GB, used: {current / 1024**3:.2f} GB, "
f"target: {target_total_usage / 1024**3:.2f} GB, block: {block_bytes / 1024**2:.2f} MB)"
f"target: {target_total_usage / 1024**3:.2f} GB, block: {block_bytes / 1024**2:.2f} MB, "
f"post_kv_free: {post_kv_free:.2f} GB)"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Replace the Unicode multiplication sign in the log line.
It triggers RUF001 and can cause copy/paste issues in terminals—use plain x.

✏️ Suggested tweak
-            f"[nanovllm] KV cache allocated: {config.num_kvcache_blocks} blocks × {self.block_size} tokens = "
+            f"[nanovllm] KV cache allocated: {config.num_kvcache_blocks} blocks x {self.block_size} tokens = "
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# If KV cache would leave less than 1 GB free, warn and suggest reducing max_model_len
post_kv_free = (free - config.num_kvcache_blocks * block_bytes) / 1024**3
if post_kv_free < 1.0:
print(
f"[nanovllm] WARNING: After KV cache allocation, only {post_kv_free:.2f} GB free. "
f"DiT inference may OOM. Consider reducing max_model_len or using CPU offload."
)
print(
f"[nanovllm] KV cache allocated: {config.num_kvcache_blocks} blocks × {self.block_size} tokens = "
f"{max_tokens_capacity} tokens capacity, {kv_cache_size_gb:.2f} GB "
f"(free: {free / 1024**3:.2f} GB, used: {current / 1024**3:.2f} GB, "
f"target: {target_total_usage / 1024**3:.2f} GB, block: {block_bytes / 1024**2:.2f} MB)"
f"target: {target_total_usage / 1024**3:.2f} GB, block: {block_bytes / 1024**2:.2f} MB, "
f"post_kv_free: {post_kv_free:.2f} GB)"
# If KV cache would leave less than 1 GB free, warn and suggest reducing max_model_len
post_kv_free = (free - config.num_kvcache_blocks * block_bytes) / 1024**3
if post_kv_free < 1.0:
print(
f"[nanovllm] WARNING: After KV cache allocation, only {post_kv_free:.2f} GB free. "
f"DiT inference may OOM. Consider reducing max_model_len or using CPU offload."
)
print(
f"[nanovllm] KV cache allocated: {config.num_kvcache_blocks} blocks x {self.block_size} tokens = "
f"{max_tokens_capacity} tokens capacity, {kv_cache_size_gb:.2f} GB "
f"(free: {free / 1024**3:.2f} GB, used: {current / 1024**3:.2f} GB, "
f"target: {target_total_usage / 1024**3:.2f} GB, block: {block_bytes / 1024**2:.2f} MB, "
f"post_kv_free: {post_kv_free:.2f} GB)"
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 278-278: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)

🤖 Prompt for AI Agents
In `@acestep/third_parts/nano-vllm/nanovllm/engine/model_runner.py` around lines
269 - 282, The f-string log in model_runner.py uses a Unicode multiplication
sign (×) which triggers RUF001 and can cause copy/paste/terminal issues; update
the print statement that formats KV cache info (the one referencing
config.num_kvcache_blocks, self.block_size, max_tokens_capacity,
kv_cache_size_gb, free, current, target_total_usage, block_bytes, post_kv_free)
to replace the Unicode "×" with a plain ASCII "x" character so the message
becomes e.g. "{config.num_kvcache_blocks} blocks x {self.block_size} tokens =
..." while keeping the rest of the formatting unchanged.

Comment on lines +157 to +160
16-20 GB (enthusiast) Songs up to 10 minutes
▓▓▓▓▓▓▓▓▓░░░░░░░░░░░ 1-4 songs at a time
Larger Songwriter brain (1.7B)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Batch size lower bound inconsistent with tier progression.

The enthusiast tier (16-20 GB) shows "1-4 songs at a time", but lower tiers (mainstream at 8-12GB and sweet spot at 12-16GB) already support "2-4 songs at a time". The lower bound of 1 for a higher tier doesn't follow a logical progression.

📝 Suggested correction
-    16-20 GB (enthusiast)    Songs up to 10 minutes
-    ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░    1-4 songs at a time
+    16-20 GB (enthusiast)    Songs up to 10 minutes
+    ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░    2-4 songs at a time
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
16-20 GB (enthusiast) Songs up to 10 minutes
▓▓▓▓▓▓▓▓▓░░░░░░░░░░░ 1-4 songs at a time
Larger Songwriter brain (1.7B)
16-20 GB (enthusiast) Songs up to 10 minutes
▓▓▓▓▓▓▓▓▓░░░░░░░░░░░ 2-4 songs at a time
Larger Songwriter brain (1.7B)
🤖 Prompt for AI Agents
In `@docs/en/ace_step_musicians_guide.md` around lines 157 - 160, Update the
enthusiast tier entry so its batch-size range follows the tier progression:
locate the line containing "16-20 GB (enthusiast)" and the phrase "1-4 songs at
a time" and change it to "2-4 songs at a time" (keeping the rest of the text,
e.g., "Songs up to 10 minutes" and "Larger Songwriter brain (1.7B)" unchanged)
so the lower bound is consistent with the 8-12GB and 12-16GB tiers.

Comment on lines +160 to +223
### 6. `tier-test` — Automated GPU Tier Testing

Automatically simulates different GPU VRAM sizes using `MAX_CUDA_VRAM` and runs a generation test at each tier. This is the recommended way to validate that all GPU tiers work correctly after modifying `acestep/gpu_config.py`.

```bash
# Test all tiers (4, 6, 8, 12, 16, 20, 24 GB)
python profile_inference.py --mode tier-test

# Test specific VRAM sizes
python profile_inference.py --mode tier-test --tiers 6 8 16

# Test with LM enabled (where the tier supports it)
python profile_inference.py --mode tier-test --tier-with-lm

# Quick test: skip torch.compile for non-quantized tiers
python profile_inference.py --mode tier-test --tier-skip-compile
```

**What it validates per tier:**
- Correct tier detection and `GPUConfig` construction
- Model initialization (DiT, VAE, Text Encoder, optionally LM)
- A short generation run (30s duration, batch=1) completes without OOM
- Adaptive VAE decode fallback (GPU → CPU offload → full CPU)
- VRAM usage stays within the simulated limit

**Output example:**

```
TIER TEST RESULTS
====================================================================================================
VRAM Tier LM Duration Status Peak VRAM Notes
──────────────────────────────────────────────────────────────────────────────
4GB tier1 — 30s ✅ OK 3.8GB VAE decoded on CPU
6GB tier2 — 30s ✅ OK 5.4GB Tiled VAE chunk=256
8GB tier4 0.6B 30s ✅ OK 7.2GB vllm backend
12GB tier5 1.7B 30s ✅ OK 10.8GB vllm backend
16GB tier6a 1.7B 30s ✅ OK 14.5GB offload enabled
20GB tier6b 1.7B 30s ✅ OK 17.2GB no offload
24GB unlimited 4B 30s ✅ OK 21.3GB full models on GPU
```

> **Note**: `tier-test` mode uses `torch.cuda.set_per_process_memory_fraction()` to enforce a hard VRAM cap, making simulations realistic even on high-end GPUs (e.g., A100 80GB).

#### Boundary Testing

Use `--tier-boundary` to find the minimum VRAM tier at which INT8 quantization and CPU offload can be safely disabled. For each tier, up to three configurations are tested:

1. **default** — tier's standard settings
2. **no-quant** — quantization disabled, offload unchanged
3. **no-offload** — no quantization AND no CPU offload

```bash
# Run boundary tests across all tiers
python profile_inference.py --mode tier-test --tier-boundary

# Boundary test with LM enabled
python profile_inference.py --mode tier-test --tier-boundary --tier-with-lm

# Save boundary results to JSON
python profile_inference.py --mode tier-test --tier-boundary --benchmark-output boundary_results.json
```

The output includes a **Boundary Analysis** summary showing the minimum tier for each capability.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language to the tier-test output code fence.
This fixes MD040 and keeps Markdown lint clean.

✏️ Suggested fix
-```
+```text
 TIER TEST RESULTS
 ====================================================================================================
 ...
-```
+```
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)

[warning] 187-187: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In `@docs/en/BENCHMARK.md` around lines 160 - 223, The sample output code fence
under the "tier-test" section (the "TIER TEST RESULTS" block) lacks a language
tag; update the opening fence from ``` to ```text to satisfy MD040 and Markdown
linting, leaving the fence contents and closing ``` unchanged so the block is
explicitly marked as plain text.

print(f" max_batch_without_lm: {gpu_config.max_batch_size_without_lm}")

# ---- Test 1: Default configuration ----
print(f"\n --- Variant: default ---")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove extraneous f-string prefix.

This string has no placeholders but uses an f-string prefix.

🔧 Proposed fix
-        print(f"\n  --- Variant: default ---")
+        print("\n  --- Variant: default ---")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(f"\n --- Variant: default ---")
print("\n --- Variant: default ---")
🧰 Tools
🪛 Ruff (0.14.14)

[error] 1072-1072: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
In `@profile_inference.py` at line 1072, The print call using an unnecessary
f-string should be changed to a regular string: locate the statement print(f"\n 
--- Variant: default ---") in profile_inference.py and remove the leading f so
it becomes print("\n  --- Variant: default ---"); no other behavior changes are
needed.

Comment on lines 1092 to 1109
else:
print(f"\n --- Variant: no-quant — SKIPPED (tier already has quantization=False) ---")

# ---- Test 3: No quantization AND no offload ----
# Skip if the tier already has both disabled
if gpu_config.quantization_default or gpu_config.offload_to_cpu_default:
print(f"\n --- Variant: no-offload (quant=None, offload=False) ---")
result_no_offload = _run_single_tier_test(
sim_gb, gpu_config, args, example_data,
checkpoint_dir, disk_lm_models,
offload_override=False,
offload_dit_override=False,
quantization_override=None,
test_variant="no-offload",
)
all_results.append(result_no_offload)
else:
print(f"\n --- Variant: no-offload — SKIPPED (tier already has offload=False, quant=False) ---")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove extraneous f-string prefixes.

Several print statements use f-strings without any placeholders.

🔧 Proposed fix
             else:
-                print(f"\n  --- Variant: no-quant — SKIPPED (tier already has quantization=False) ---")
+                print("\n  --- Variant: no-quant — SKIPPED (tier already has quantization=False) ---")

             # ---- Test 3: No quantization AND no offload ----
             # Skip if the tier already has both disabled
             if gpu_config.quantization_default or gpu_config.offload_to_cpu_default:
-                print(f"\n  --- Variant: no-offload (quant=None, offload=False) ---")
+                print("\n  --- Variant: no-offload (quant=None, offload=False) ---")
                 result_no_offload = _run_single_tier_test(
                     ...
                 )
                 all_results.append(result_no_offload)
             else:
-                print(f"\n  --- Variant: no-offload — SKIPPED (tier already has offload=False, quant=False) ---")
+                print("\n  --- Variant: no-offload — SKIPPED (tier already has offload=False, quant=False) ---")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
else:
print(f"\n --- Variant: no-quant — SKIPPED (tier already has quantization=False) ---")
# ---- Test 3: No quantization AND no offload ----
# Skip if the tier already has both disabled
if gpu_config.quantization_default or gpu_config.offload_to_cpu_default:
print(f"\n --- Variant: no-offload (quant=None, offload=False) ---")
result_no_offload = _run_single_tier_test(
sim_gb, gpu_config, args, example_data,
checkpoint_dir, disk_lm_models,
offload_override=False,
offload_dit_override=False,
quantization_override=None,
test_variant="no-offload",
)
all_results.append(result_no_offload)
else:
print(f"\n --- Variant: no-offload — SKIPPED (tier already has offload=False, quant=False) ---")
else:
print("\n --- Variant: no-quant — SKIPPED (tier already has quantization=False) ---")
# ---- Test 3: No quantization AND no offload ----
# Skip if the tier already has both disabled
if gpu_config.quantization_default or gpu_config.offload_to_cpu_default:
print("\n --- Variant: no-offload (quant=None, offload=False) ---")
result_no_offload = _run_single_tier_test(
sim_gb, gpu_config, args, example_data,
checkpoint_dir, disk_lm_models,
offload_override=False,
offload_dit_override=False,
quantization_override=None,
test_variant="no-offload",
)
all_results.append(result_no_offload)
else:
print("\n --- Variant: no-offload — SKIPPED (tier already has offload=False, quant=False) ---")
🧰 Tools
🪛 Ruff (0.14.14)

[error] 1093-1093: f-string without any placeholders

Remove extraneous f prefix

(F541)


[error] 1098-1098: f-string without any placeholders

Remove extraneous f prefix

(F541)


[error] 1109-1109: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
In `@profile_inference.py` around lines 1092 - 1109, Fix the unnecessary f-string
prefixes on print statements that have no interpolations: replace print(f"...")
with print("...") for the messages around the "no-quant" and "no-offload"
variants in the block that calls _run_single_tier_test; specifically update the
print calls that reference the no-quant and no-offload messages which use
gpu_config.quantization_default and gpu_config.offload_to_cpu_default to
determine skips so they are ordinary string literals instead of f-strings.

Comment on lines +1276 to +1286
if failing:
print(f" {capability_name}:")
print(f" ❌ No tier passed this test. All tested tiers failed.")
for r in failing:
err = (r.get("error") or "unknown")[:50]
print(f" {r['tier_gb']}GB ({r['tier']}): {err}")
else:
print(f" {capability_name}:")
print(f" ⚠️ No test results available for this capability.")
print()
return passing[0] if passing else None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove extraneous f-string prefixes in boundary summary.

Two print statements use f-strings without placeholders.

🔧 Proposed fix
             if failing:
-                print(f"  {capability_name}:")
-                print(f"    ❌ No tier passed this test. All tested tiers failed.")
+                print(f"  {capability_name}:")  # This one is fine, has placeholder
+                print("    ❌ No tier passed this test. All tested tiers failed.")
                 for r in failing:
                     err = (r.get("error") or "unknown")[:50]
                     print(f"       {r['tier_gb']}GB ({r['tier']}): {err}")
             else:
                 print(f"  {capability_name}:")
-                print(f"    ⚠️ No test results available for this capability.")
+                print("    ⚠️ No test results available for this capability.")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if failing:
print(f" {capability_name}:")
print(f" ❌ No tier passed this test. All tested tiers failed.")
for r in failing:
err = (r.get("error") or "unknown")[:50]
print(f" {r['tier_gb']}GB ({r['tier']}): {err}")
else:
print(f" {capability_name}:")
print(f" ⚠️ No test results available for this capability.")
print()
return passing[0] if passing else None
if failing:
print(f" {capability_name}:") # This one is fine, has placeholder
print(" ❌ No tier passed this test. All tested tiers failed.")
for r in failing:
err = (r.get("error") or "unknown")[:50]
print(f" {r['tier_gb']}GB ({r['tier']}): {err}")
else:
print(f" {capability_name}:")
print(" ⚠️ No test results available for this capability.")
print()
return passing[0] if passing else None
🧰 Tools
🪛 Ruff (0.14.14)

[error] 1278-1278: f-string without any placeholders

Remove extraneous f prefix

(F541)


[error] 1284-1284: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
In `@profile_inference.py` around lines 1276 - 1286, The two print calls that use
f-strings without placeholders should be regular strings: in the block that
references capability_name, failing, and passing (the one that prints the
boundary summary and returns passing[0] if any), replace print(f"    ❌ No tier
passed this test. All tested tiers failed.") and print(f"    ⚠️ No test results
available for this capability.") with print("    ❌ No tier passed this test. All
tested tiers failed.") and print("    ⚠️ No test results available for this
capability.") respectively so the unnecessary f-string prefixes are removed.

Comment on lines +165 to +183
try:
with torch.inference_mode():
# Simulate DiT inference inputs
# Create dummy latent noise
noise = torch.randn(batch_size, latent_length, latent_dim, device="cuda", dtype=torch.bfloat16)

# Simulate text encoder output
text_hidden = torch.randn(batch_size, 512, 768, device="cuda", dtype=torch.bfloat16)
text_mask = torch.ones(batch_size, 512, device="cuda", dtype=torch.long)

# If has CFG, double the batch for classifier-free guidance
if has_cfg:
noise_cfg = torch.cat([noise, noise], dim=0)
text_hidden_cfg = torch.cat([text_hidden, text_hidden], dim=0)
text_mask_cfg = torch.cat([text_mask, text_mask], dim=0)
del noise_cfg, text_hidden_cfg, text_mask_cfg

del noise, text_hidden, text_mask
torch.cuda.synchronize()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file and understand its context
wc -l scripts/profile_vram.py

Repository: ace-step/ACE-Step-1.5

Length of output: 92


🏁 Script executed:

# Let's read the file to understand the full context around lines 165-183
cat -n scripts/profile_vram.py | sed -n '150,200p'

Repository: ace-step/ACE-Step-1.5

Length of output: 2780


🏁 Script executed:

# Let's also see the beginning of the function that contains this code
cat -n scripts/profile_vram.py | sed -n '1,50p'

Repository: ace-step/ACE-Step-1.5

Length of output: 2044


🏁 Script executed:

# Let's see the function that contains this code - find where it starts
cat -n scripts/profile_vram.py | sed -n '100,165p'

Repository: ace-step/ACE-Step-1.5

Length of output: 2778


🏁 Script executed:

# Search for model.forward or model( calls in the file
rg -n "model\.|\.forward\(|\.generate\(" scripts/profile_vram.py

Repository: ace-step/ACE-Step-1.5

Length of output: 347


🏁 Script executed:

# Let's see the complete try-except block to verify if model is called anywhere
cat -n scripts/profile_vram.py | sed -n '165,210p'

Repository: ace-step/ACE-Step-1.5

Length of output: 2706


DiT "inference" profiling doesn't run the model.

The loop only allocates dummy tensors (lines 169–182) and deletes them; it never executes a forward pass. The peak memory measurement (line 186) captures only input tensor allocation, excluding all activation memory from the actual model computation. This will severely under-report peak VRAM usage and mis-calibrate GPU tier selection. Either run a minimal forward pass with model(noise, text_hidden, text_mask) to include activations, or rename peak_inference_gb to peak_input_allocation_gb to clarify the scope.

🤖 Prompt for AI Agents
In `@scripts/profile_vram.py` around lines 165 - 183, The current DiT profiling
only allocates and deletes dummy tensors (noise, text_hidden, text_mask) and
never executes the model, so peak memory misses activation usage; replace the
no-op block with a minimal forward pass by calling the DiT model (e.g.,
model(noise, text_hidden, text_mask) or model.forward(...)) inside the
torch.inference_mode() context so activations are allocated and measured, and
when has_cfg is true duplicate the inputs (noise_cfg, text_hidden_cfg,
text_mask_cfg) and pass the doubled batch to the model to simulate
classifier-free guidance; alternatively, if you intentionally only want to
measure input allocation, rename peak_inference_gb to peak_input_allocation_gb
to reflect the narrower measurement.

Comment on lines +318 to +321
encoder_path = os.path.join(checkpoint_dir, "text_encoder")
if not os.path.exists(encoder_path):
print(f" Text encoder not found: {encoder_path}")
return {}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Text encoder path doesn’t match runtime checkpoints.

The handler loads the encoder from Qwen3-Embedding-0.6B, but this script looks for text_encoder, so profiling will likely skip it. Align the path or add a fallback.

🛠️ Proposed fix (use runtime path with fallback)
-    encoder_path = os.path.join(checkpoint_dir, "text_encoder")
-    if not os.path.exists(encoder_path):
+    encoder_path = os.path.join(checkpoint_dir, "Qwen3-Embedding-0.6B")
+    if not os.path.exists(encoder_path):
+        encoder_path = os.path.join(checkpoint_dir, "text_encoder")
+    if not os.path.exists(encoder_path):
         print(f"  Text encoder not found: {encoder_path}")
         return {}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
encoder_path = os.path.join(checkpoint_dir, "text_encoder")
if not os.path.exists(encoder_path):
print(f" Text encoder not found: {encoder_path}")
return {}
encoder_path = os.path.join(checkpoint_dir, "Qwen3-Embedding-0.6B")
if not os.path.exists(encoder_path):
encoder_path = os.path.join(checkpoint_dir, "text_encoder")
if not os.path.exists(encoder_path):
print(f" Text encoder not found: {encoder_path}")
return {}
🤖 Prompt for AI Agents
In `@scripts/profile_vram.py` around lines 318 - 321, The code builds encoder_path
= os.path.join(checkpoint_dir, "text_encoder") which doesn't match the runtime
checkpoint name (e.g., "Qwen3-Embedding-0.6B"), so profiling can skip the
encoder; update the logic in scripts/profile_vram.py around encoder_path to try
the runtime checkpoint name as a fallback (check for
os.path.exists(os.path.join(checkpoint_dir, "Qwen3-Embedding-0.6B")) if the
"text_encoder" path is missing) and only return {} after both attempts fail, or
prefer the runtime-named directory when present; ensure references to
encoder_path, checkpoint_dir and the literal names ("text_encoder",
"Qwen3-Embedding-0.6B") are used so the handler and this script align.

chuxij added 2 commits February 10, 2026 13:06
Root cause: tier6a (16-20GB) had max_batch_size_with_lm=1, which was
overly conservative. Empirical testing on 16GB (simulated) showed:
- Without LM: batch=4 uses 13.3GB, batch=7 uses 13.4GB (all fit in 16GB)
- With LM (1.7B): batch=2 uses 11.9GB, batch=4 fits within 16GB budget

Changes:
- tier6a: max_batch_size_with_lm 1→4, max_batch_size_without_lm 4→8
- tier6b: max_batch_size_with_lm 2→4 (20-24GB has ample headroom)
- Added --tier-batch-boundary flag to profile_inference.py for automated
  batch size boundary testing (escalates 1,2,4,8 with LM and without LM)
- Added GPU tier config patching during batch tests to bypass inference.py
  batch clamping
- Updated GPU_COMPATIBILITY docs (en/zh/ja/ko) and BENCHMARK docs (en/zh)
  with corrected batch limits and new batch boundary testing instructions
- Updated tests to match new batch size expectations
…bled

Root cause: _vram_guard_reduce_batch checks free VRAM *before* DiT runs,
but at that point the vllm LM model (weights + KV cache) is still on GPU.
On a 16GB GPU with 1.7B LM loaded, only ~7.6GB appears free, causing the
guard to slash batch_size from 4 to 1 — even though the LM will be
offloaded before DiT actually needs the memory.

Fix: When offload_to_cpu=True, trust the static GPU tier config limits
(which were empirically validated with offload enabled) instead of the
misleading instantaneous free VRAM reading. If batch_size <= tier's
max_batch_size_with_lm, skip the dynamic VRAM check entirely.

This fixes the bug where users with 16GB GPUs saw LM generate 4 audio
codes but DiT only produced 1 output.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Fix all issues with AI agents
In `@docs/en/GPU_COMPATIBILITY.md`:
- Around line 141-149: The fenced code block showing the "BOUNDARY ANALYSIS"
table lacks a language specifier; update the opening fence that precedes the
BOUNDARY ANALYSIS block from ``` to ```text so the block is explicitly marked as
plain text (look for the lines containing the literal "BOUNDARY ANALYSIS" and
the surrounding triple-backtick fence and change the opening fence accordingly).
- Around line 79-83: Update the tier labels for the GPU simulation examples:
change the comment "Simulate an 8GB GPU (Tier 4)" to "Simulate an 8GB GPU (Tier
3)" and change "Simulate a 12GB GPU (Tier 5)" to "Simulate a 12GB GPU (Tier 4)"
for the lines with MAX_CUDA_VRAM=8 uv run acestep and MAX_CUDA_VRAM=12 uv run
acestep so the examples match the mapping (≤8GB → tier3, ≤12GB → tier4).

In `@docs/ja/GPU_COMPATIBILITY.md`:
- Around line 79-83: Update the comment labels for the GPU simulation examples:
change the "8GB GPU (Tier 4) をシミュレート" comment to "8GB GPU (Tier 3) をシミュレート" and
change the "12GB GPU (Tier 5) をシミュレート" comment to "12GB GPU (Tier 4) をシミュレート" so
the comments match the tier mapping for the MAX_CUDA_VRAM examples (the lines
using MAX_CUDA_VRAM=8 and MAX_CUDA_VRAM=12 before running "uv run acestep").

In `@docs/ko/GPU_COMPATIBILITY.md`:
- Around line 79-83: The tier labels are incorrect for the 8GB/12GB examples;
update the headings for the examples shown (the commented lines above the
commands using MAX_CUDA_VRAM and uv run acestep) so the 8GB example reads "8GB
GPU 시뮬레이션 (티어 3)" and the 12GB example reads "12GB GPU 시뮬레이션 (티어 4)" to match
the mapping ≤8GB → tier3 and ≤12GB → tier4.

In `@docs/zh/BENCHMARK.md`:
- Around line 165-166: Update the "测试所有等级" section so the listed tiers match the
actual default tiers used by profile_inference.py; replace the current list
(which shows 20GB and omits 48GB) with the real defaults (include 48GB and
remove 20GB), or add a short note stating that 20GB is not a default and must be
provided via the --tiers flag; refer to the "python profile_inference.py --mode
tier-test" invocation and the default tier list in the script when making the
change.

In `@docs/zh/GPU_COMPATIBILITY.md`:
- Around line 141-149: The fenced code block that starts with the "BOUNDARY
ANALYSIS" header should include a language specifier to ensure proper rendering;
update the opening triple-backtick for the block containing "BOUNDARY ANALYSIS"
and the table (the block that currently begins with ``` and the header line
"BOUNDARY ANALYSIS") to use a language tag (e.g., change ``` to ```text) so the
table renders as plain text in documentation.
- Around line 79-83: Update the tier labels in the examples: change the heading
"模拟 8GB GPU (Tier 4)" to "模拟 8GB GPU (Tier 3)" and change "模拟 12GB GPU (Tier 5)"
to "模拟 12GB GPU (Tier 4)"; keep the example commands (MAX_CUDA_VRAM=8 uv run
acestep and MAX_CUDA_VRAM=12 uv run acestep) unchanged and ensure the
documentation reflects the mapping ≤8GB → tier3 and ≤12GB → tier4.

In `@profile_inference.py`:
- Around line 789-798: The code may pick a disk-only LM that is too large for
the current tier because disk_lm_models are not filtered by tier size; before
calling find_best_lm_model_on_disk, filter disk_lm_models to only include models
whose size is compatible with the current tier (use the tier variable and
gpu_config-recommended sizing rules) and then pass that filtered list to
find_best_lm_model_on_disk (keep references to lm_model, use_lm,
find_best_lm_model_on_disk, disk_lm_models, gpu_config.recommended_lm_model and
tier so the change is easy to locate).
- Around line 815-823: The current CUDA memory-fraction logic only sets a
reduced fraction when sim_gb < physical VRAM, but doesn't reset the per-process
cap when sim_gb >= physical VRAM, leaving a prior smaller cap in place; update
the block in profile_inference.py that checks torch.cuda.is_available() (the
code using torch.cuda.get_device_properties, total_gb, sim_gb and
torch.cuda.set_per_process_memory_fraction) so that when sim_gb >= total_gb you
explicitly call torch.cuda.set_per_process_memory_fraction(1.0) to clear any
previous cap; retain the existing reduced-fraction calculation path for sim_gb <
total_gb.

Comment on lines 79 to 83
# Simulate an 8GB GPU (Tier 4)
MAX_CUDA_VRAM=8 uv run acestep

# Simulate a 12GB GPU (Tier 5)
MAX_CUDA_VRAM=12 uv run acestep
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Correct tier labels for 8GB/12GB simulation examples.

Tier mapping is ≤8GB → tier3 and ≤12GB → tier4.

📝 Suggested update
-# Simulate an 8GB GPU (Tier 4)
+# Simulate an 8GB GPU (Tier 3)

-# Simulate a 12GB GPU (Tier 5)
+# Simulate a 12GB GPU (Tier 4)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Simulate an 8GB GPU (Tier 4)
MAX_CUDA_VRAM=8 uv run acestep
# Simulate a 12GB GPU (Tier 5)
MAX_CUDA_VRAM=12 uv run acestep
# Simulate an 8GB GPU (Tier 3)
MAX_CUDA_VRAM=8 uv run acestep
# Simulate a 12GB GPU (Tier 4)
MAX_CUDA_VRAM=12 uv run acestep
🤖 Prompt for AI Agents
In `@docs/en/GPU_COMPATIBILITY.md` around lines 79 - 83, Update the tier labels
for the GPU simulation examples: change the comment "Simulate an 8GB GPU (Tier
4)" to "Simulate an 8GB GPU (Tier 3)" and change "Simulate a 12GB GPU (Tier 5)"
to "Simulate a 12GB GPU (Tier 4)" for the lines with MAX_CUDA_VRAM=8 uv run
acestep and MAX_CUDA_VRAM=12 uv run acestep so the examples match the mapping
(≤8GB → tier3, ≤12GB → tier4).

Comment on lines +141 to +149
```
BOUNDARY ANALYSIS
=================
Capability Min Tier VRAM
------------------------------------------------------------
No INT8 Quantization tier6b 20GB
No CPU Offload (all models on GPU) tier6b 20GB
------------------------------------------------------------
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language specifier to the boundary-analysis output block.

📝 Suggested update
-```
+```text
 BOUNDARY ANALYSIS
 =================
   Capability                                    Min Tier   VRAM
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)

[warning] 141-141: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In `@docs/en/GPU_COMPATIBILITY.md` around lines 141 - 149, The fenced code block
showing the "BOUNDARY ANALYSIS" table lacks a language specifier; update the
opening fence that precedes the BOUNDARY ANALYSIS block from ``` to ```text so
the block is explicitly marked as plain text (look for the lines containing the
literal "BOUNDARY ANALYSIS" and the surrounding triple-backtick fence and change
the opening fence accordingly).

Comment on lines 79 to 83
# 8GB GPU (Tier 4) をシミュレート
MAX_CUDA_VRAM=8 uv run acestep

# 12GB GPU (Tier 5) をシミュレート
MAX_CUDA_VRAM=12 uv run acestep
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Correct tier labels for 8GB/12GB simulation examples.

Tier mapping is ≤8GB → tier3 and ≤12GB → tier4.

📝 Suggested update
-# 8GB GPU (Tier 4) をシミュレート
+# 8GB GPU (Tier 3) をシミュレート

-# 12GB GPU (Tier 5) をシミュレート
+# 12GB GPU (Tier 4) をシミュレート
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# 8GB GPU (Tier 4) をシミュレート
MAX_CUDA_VRAM=8 uv run acestep
# 12GB GPU (Tier 5) をシミュレート
MAX_CUDA_VRAM=12 uv run acestep
# 8GB GPU (Tier 3) をシミュレート
MAX_CUDA_VRAM=8 uv run acestep
# 12GB GPU (Tier 4) をシミュレート
MAX_CUDA_VRAM=12 uv run acestep
🤖 Prompt for AI Agents
In `@docs/ja/GPU_COMPATIBILITY.md` around lines 79 - 83, Update the comment labels
for the GPU simulation examples: change the "8GB GPU (Tier 4) をシミュレート" comment
to "8GB GPU (Tier 3) をシミュレート" and change the "12GB GPU (Tier 5) をシミュレート" comment
to "12GB GPU (Tier 4) をシミュレート" so the comments match the tier mapping for the
MAX_CUDA_VRAM examples (the lines using MAX_CUDA_VRAM=8 and MAX_CUDA_VRAM=12
before running "uv run acestep").

Comment on lines 79 to 83
# 8GB GPU 시뮬레이션 (티어 4)
MAX_CUDA_VRAM=8 uv run acestep

# 12GB GPU 시뮬레이션 (티어 5)
MAX_CUDA_VRAM=12 uv run acestep
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Correct tier labels for 8GB/12GB simulation examples.

Tier mapping is ≤8GB → tier3 and ≤12GB → tier4.

📝 Suggested update
-# 8GB GPU 시뮬레이션 (티어 4)
+# 8GB GPU 시뮬레이션 (티어 3)

-# 12GB GPU 시뮬레이션 (티어 5)
+# 12GB GPU 시뮬레이션 (티어 4)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# 8GB GPU 시뮬레이션 (티어 4)
MAX_CUDA_VRAM=8 uv run acestep
# 12GB GPU 시뮬레이션 (티어 5)
MAX_CUDA_VRAM=12 uv run acestep
# 8GB GPU 시뮬레이션 (티어 3)
MAX_CUDA_VRAM=8 uv run acestep
# 12GB GPU 시뮬레이션 (티어 4)
MAX_CUDA_VRAM=12 uv run acestep
🤖 Prompt for AI Agents
In `@docs/ko/GPU_COMPATIBILITY.md` around lines 79 - 83, The tier labels are
incorrect for the 8GB/12GB examples; update the headings for the examples shown
(the commented lines above the commands using MAX_CUDA_VRAM and uv run acestep)
so the 8GB example reads "8GB GPU 시뮬레이션 (티어 3)" and the 12GB example reads "12GB
GPU 시뮬레이션 (티어 4)" to match the mapping ≤8GB → tier3 and ≤12GB → tier4.

Comment on lines +165 to +166
# 测试所有等级 (4, 6, 8, 12, 16, 20, 24 GB)
python profile_inference.py --mode tier-test
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Align the “test all tiers” list with actual tier-test defaults.

The default tier list in code includes 48GB and does not include 20GB; the doc should match or explicitly note that 20GB is only included via --tiers.

📝 Suggested update
-# 测试所有等级 (4, 6, 8, 12, 16, 20, 24 GB)
+# 测试所有等级 (4, 6, 8, 12, 16, 24, 48 GB)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# 测试所有等级 (4, 6, 8, 12, 16, 20, 24 GB)
python profile_inference.py --mode tier-test
# 测试所有等级 (4, 6, 8, 12, 16, 24, 48 GB)
python profile_inference.py --mode tier-test
🤖 Prompt for AI Agents
In `@docs/zh/BENCHMARK.md` around lines 165 - 166, Update the "测试所有等级" section so
the listed tiers match the actual default tiers used by profile_inference.py;
replace the current list (which shows 20GB and omits 48GB) with the real
defaults (include 48GB and remove 20GB), or add a short note stating that 20GB
is not a default and must be provided via the --tiers flag; refer to the "python
profile_inference.py --mode tier-test" invocation and the default tier list in
the script when making the change.

Comment on lines 79 to 83
# 模拟 8GB GPU (Tier 4)
MAX_CUDA_VRAM=8 uv run acestep

# 模拟 12GB GPU (Tier 5)
MAX_CUDA_VRAM=12 uv run acestep
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Correct tier labels for 8GB/12GB simulation examples.

Tier mapping is ≤8GB → tier3 and ≤12GB → tier4.

📝 Suggested update
-# 模拟 8GB GPU (Tier 4)
+# 模拟 8GB GPU (Tier 3)

-# 模拟 12GB GPU (Tier 5)
+# 模拟 12GB GPU (Tier 4)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# 模拟 8GB GPU (Tier 4)
MAX_CUDA_VRAM=8 uv run acestep
# 模拟 12GB GPU (Tier 5)
MAX_CUDA_VRAM=12 uv run acestep
# 模拟 8GB GPU (Tier 3)
MAX_CUDA_VRAM=8 uv run acestep
# 模拟 12GB GPU (Tier 4)
MAX_CUDA_VRAM=12 uv run acestep
🤖 Prompt for AI Agents
In `@docs/zh/GPU_COMPATIBILITY.md` around lines 79 - 83, Update the tier labels in
the examples: change the heading "模拟 8GB GPU (Tier 4)" to "模拟 8GB GPU (Tier 3)"
and change "模拟 12GB GPU (Tier 5)" to "模拟 12GB GPU (Tier 4)"; keep the example
commands (MAX_CUDA_VRAM=8 uv run acestep and MAX_CUDA_VRAM=12 uv run acestep)
unchanged and ensure the documentation reflects the mapping ≤8GB → tier3 and
≤12GB → tier4.

Comment on lines +141 to +149
```
BOUNDARY ANALYSIS
=================
Capability Min Tier VRAM
------------------------------------------------------------
No INT8 Quantization tier6b 20GB
No CPU Offload (all models on GPU) tier6b 20GB
------------------------------------------------------------
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language specifier to the boundary-analysis output block.

📝 Suggested update
-```
+```text
 BOUNDARY ANALYSIS
 =================
   Capability                                    Min Tier   VRAM
🤖 Prompt for AI Agents
In `@docs/zh/GPU_COMPATIBILITY.md` around lines 141 - 149, The fenced code block
that starts with the "BOUNDARY ANALYSIS" header should include a language
specifier to ensure proper rendering; update the opening triple-backtick for the
block containing "BOUNDARY ANALYSIS" and the table (the block that currently
begins with ``` and the header line "BOUNDARY ANALYSIS") to use a language tag
(e.g., change ``` to ```text) so the table renders as plain text in
documentation.

Comment on lines +789 to +798
# Find LM model on disk
lm_model = None
lm_backend = gpu_config.recommended_backend
if use_lm:
lm_model = find_best_lm_model_on_disk(
gpu_config.recommended_lm_model, disk_lm_models
)
if not lm_model:
print(f" ⚠️ No compatible LM model on disk for tier {tier}, skipping LM")
use_lm = False
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Filter disk LM models by tier-compatible sizes before selection.

Without size filtering, a larger disk-only LM can be chosen and cause false failures on low tiers.

🛠️ Suggested fix
     if use_lm:
-        lm_model = find_best_lm_model_on_disk(
-            gpu_config.recommended_lm_model, disk_lm_models
-        )
+        compatible_models = [
+            m for m in disk_lm_models
+            if is_lm_model_size_allowed(m, gpu_config.available_lm_models)
+        ]
+        lm_model = find_best_lm_model_on_disk(
+            gpu_config.recommended_lm_model, compatible_models
+        )
         if not lm_model:
             print(f"  ⚠️ No compatible LM model on disk for tier {tier}, skipping LM")
             use_lm = False
🤖 Prompt for AI Agents
In `@profile_inference.py` around lines 789 - 798, The code may pick a disk-only
LM that is too large for the current tier because disk_lm_models are not
filtered by tier size; before calling find_best_lm_model_on_disk, filter
disk_lm_models to only include models whose size is compatible with the current
tier (use the tier variable and gpu_config-recommended sizing rules) and then
pass that filtered list to find_best_lm_model_on_disk (keep references to
lm_model, use_lm, find_best_lm_model_on_disk, disk_lm_models,
gpu_config.recommended_lm_model and tier so the change is easy to locate).

Comment on lines +815 to +823
if torch.cuda.is_available():
total_bytes = torch.cuda.get_device_properties(0).total_memory
total_gb = total_bytes / (1024 ** 3)
if sim_gb < total_gb:
reference_context_gb = 0.5
allocator_budget_gb = max(0.5, sim_gb - reference_context_gb)
fraction = max(0.01, min(1.0, allocator_budget_gb / total_gb))
torch.cuda.set_per_process_memory_fraction(fraction)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "profile_inference.py" -type f

Repository: ace-step/ACE-Step-1.5

Length of output: 87


🏁 Script executed:

# Get file size first
wc -l profile_inference.py 2>/dev/null || echo "File not found, searching..."
find . -name "*.py" -path "*profile*" -type f

Repository: ace-step/ACE-Step-1.5

Length of output: 139


🏁 Script executed:

# Read the specific lines with broader context (lines 800-835)
sed -n '800,835p' profile_inference.py | cat -n

Repository: ace-step/ACE-Step-1.5

Length of output: 1767


🏁 Script executed:

# Search for tier-test related code and loop structures
rg -n "tier" profile_inference.py -i -A 2 -B 2

Repository: ace-step/ACE-Step-1.5

Length of output: 22119


🏁 Script executed:

# Look for where this function is called
rg -n "torch.cuda.set_per_process_memory_fraction" . --type py -B 5 -A 5

Repository: ace-step/ACE-Step-1.5

Length of output: 2768


🏁 Script executed:

# Check PyTorch documentation or source for set_per_process_memory_fraction behavior
rg -n "set_per_process_memory_fraction\|memory_fraction" . --type py -B 2 -A 2

Repository: ace-step/ACE-Step-1.5

Length of output: 47


🏁 Script executed:

# Look at the loop structure more carefully - check if tiers are tested in order and if multiple variants are called per tier
sed -n '1075,1190p' profile_inference.py | cat -n

Repository: ace-step/ACE-Step-1.5

Length of output: 6728


🏁 Script executed:

# Check if there's any initialization or reset code before/after the tier loop
sed -n '1019,1100p' profile_inference.py | cat -n

Repository: ace-step/ACE-Step-1.5

Length of output: 4212


🏁 Script executed:

# Search for any documentation or comments about the memory fraction behavior
rg -n "memory.fraction\|cuda.*fraction\|1\.0\|reset" profile_inference.py -i | head -30

Repository: ace-step/ACE-Step-1.5

Length of output: 47


🏁 Script executed:

# Check if there's any documentation or code comments about memory fraction in the codebase
rg -n "1\.0.*fraction\|fraction.*1\.0\|unrestricted\|reset.*cuda\|cuda.*reset" . --type py -i

Repository: ace-step/ACE-Step-1.5

Length of output: 47


🏁 Script executed:

# Look for any tests or examples of set_per_process_memory_fraction to understand typical usage
rg -n "set_per_process_memory_fraction" . --type py -B 3 -A 3

Repository: ace-step/ACE-Step-1.5

Length of output: 3654


Reset CUDA memory fraction to 1.0 when sim_gb ≥ physical GPU VRAM.

Within each tier, multiple test variants (default, no-quant, no-offload, batch-N) are run sequentially. A smaller fraction set by an earlier variant persists and caps later variants, causing false failures or skewed VRAM measurements. Always reset to 1.0 when the simulated VRAM is at or above physical capacity to clear the prior cap.

🛠️ Suggested fix
         if sim_gb < total_gb:
             reference_context_gb = 0.5
             allocator_budget_gb = max(0.5, sim_gb - reference_context_gb)
             fraction = max(0.01, min(1.0, allocator_budget_gb / total_gb))
             torch.cuda.set_per_process_memory_fraction(fraction)
+        else:
+            # Ensure we don't keep a tighter cap from a previous tier or variant
+            torch.cuda.set_per_process_memory_fraction(1.0)
🤖 Prompt for AI Agents
In `@profile_inference.py` around lines 815 - 823, The current CUDA
memory-fraction logic only sets a reduced fraction when sim_gb < physical VRAM,
but doesn't reset the per-process cap when sim_gb >= physical VRAM, leaving a
prior smaller cap in place; update the block in profile_inference.py that checks
torch.cuda.is_available() (the code using torch.cuda.get_device_properties,
total_gb, sim_gb and torch.cuda.set_per_process_memory_fraction) so that when
sim_gb >= total_gb you explicitly call
torch.cuda.set_per_process_memory_fraction(1.0) to clear any previous cap;
retain the existing reduced-fraction calculation path for sim_gb < total_gb.

8GB GPUs with 0.6B LM + offload have enough headroom for batch=2.
DiT(4.46) + context(0.5) ≈ 5.0GB leaves ~3GB free, sufficient for
2 samples of DiT activations (~0.8GB each at 60s).

Updated gpu_config.py and docs (en/zh/ja).
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@acestep/handler.py`:
- Around line 1605-1607: The device check currently compares the device variable
to strings and can miss non-CUDA devices (e.g., torch.device objects like
"xpu"), causing CUDA-only VRAM logic to run incorrectly; update the check in the
handler where device is read (the block using device = self.device and the
similar block at the later occurrence) to reliably detect CUDA by converting to
string or using device.type (e.g., if str(device) in ("cpu","mps") or if
getattr(device,"type",str(device)) in ("cpu","mps")), and keep the subsequent
call to get_effective_free_vram_gb() only for CUDA devices so batch_size isn't
forced to 1 on non-CUDA devices.
🧹 Nitpick comments (1)
acestep/handler.py (1)

1583-1586: use_lm is currently unused in the guard.

Either drop it or use it to pick the appropriate tier limit so callers can control whether LM-specific caps apply.

♻️ Possible refinement
-                tier_max = gpu_config.max_batch_size_with_lm
+                tier_max = (
+                    gpu_config.max_batch_size_with_lm
+                    if use_lm
+                    else gpu_config.max_batch_size_without_lm
+                )

Also applies to: 1615-1622

Comment on lines +1605 to +1607
device = self.device
if device == "cpu" or device == "mps":
return batch_size # No CUDA VRAM to guard
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Skip CUDA-only free-VRAM checks on non-CUDA devices to avoid forced batch=1.

The current guard uses string equality comparison on what is likely a torch.device object, which fails for non-CUDA devices. Devices like "xpu" bypass this check and proceed to call get_effective_free_vram_gb(), which may report 0 VRAM and collapse batch size. Convert the device to string before comparison or use .type attribute to reliably detect CUDA devices.

🛠️ Suggested fix
-        device = self.device
-        if device == "cpu" or device == "mps":
-            return batch_size  # No CUDA VRAM to guard
+        device = self.device
+        device_str = str(device)
+        is_cuda = device_str == "cuda" or device_str.startswith("cuda")
+        if not is_cuda:
+            return batch_size  # No CUDA VRAM to guard

Also applies to: 1631-1633

🤖 Prompt for AI Agents
In `@acestep/handler.py` around lines 1605 - 1607, The device check currently
compares the device variable to strings and can miss non-CUDA devices (e.g.,
torch.device objects like "xpu"), causing CUDA-only VRAM logic to run
incorrectly; update the check in the handler where device is read (the block
using device = self.device and the similar block at the later occurrence) to
reliably detect CUDA by converting to string or using device.type (e.g., if
str(device) in ("cpu","mps") or if getattr(device,"type",str(device)) in
("cpu","mps")), and keep the subsequent call to get_effective_free_vram_gb()
only for CUDA devices so batch_size isn't forced to 1 on non-CUDA devices.

chuxij and others added 6 commits February 10, 2026 13:30
When batch_size > 1, VAE decode VRAM scales linearly with batch size.
On 8GB GPUs (tier3, batch=2), decoding 2 samples at once exceeds VRAM.

Fix: In _tiled_decode_inner, when B > 1, decode each sample individually
and move results to CPU immediately after each decode. This keeps peak
VRAM constant regardless of batch size.

Also updated tier3 max_batch_size_with_lm from 1 to 2 (8GB GPUs with
0.6B LM + offload have sufficient headroom for batch=2).
Now that VAE decode is batch-sequential (no extra VRAM per sample),
the bottleneck is only DiT activations which scale modestly.

Updated batch limits:
  - tier5  (12-16GB): with_lm 2→4, without_lm stays 4
  - tier6b (20-24GB): with_lm 4→8, without_lm stays 8

Summary of all tiers (LM / No LM):
  tier1  ≤4GB:   1/1    tier4  8-12GB:  2/4
  tier2  4-6GB:  1/1    tier5  12-16GB: 4/4
  tier3  6-8GB:  2/2    tier6a 16-20GB: 4/8
  tier6b 20-24GB: 8/8   unlimited ≥24GB: 8/8

Updated docs (en/zh/ja).
- test_time_scaling.py: Add _load_scoring_model_context() that moves the
  HF scoring model to GPU only during forward pass and offloads back to
  CPU afterwards (for vllm/mlx backends). Move output logits to CPU to
  avoid keeping large vocab tensors on GPU.

- llm_inference.py: When offload_to_cpu=True, keep the HF scoring model
  on CPU after initial loading (vllm/mlx backends). The context manager
  in test_time_scaling.py handles GPU placement on demand.

- dit_alignment_score.py: Force MusicLyricScorer.calculate_score() to
  always compute on CPU. The scoring matrices are small and do not
  benefit from GPU acceleration, while occupying VRAM that DiT/VAE/LM
  need on low-VRAM GPUs.
Keep _vram_guard_reduce_batch (our feature). Remove
_start_diffusion_progress_estimator (now provided by ProgressMixin
from acestep/core/generation/handler/progress.py).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants